Importing and extracting tables from PDF into R using “pdftools”

Introduction

Extracting tables from PDFs is often necessary when clients share data in PDF format. However, using free online tools to convert PDFs can pose a risk to confidential information. Fortunately, there is a safer alternative: extracting data directly using the “pdftools” package in R.

While there are other libraries, such as tabulizer, it requires Java and is not currently available on CRAN.

Let us begin with installing and loading the required package

install.packages("pdftools")
library(pdftools)

Step 1) Define a reusable path variable for the PDF file

mypath  <- "Path_to_pdf/my_pdf.pdf"

Step 2) Import the pdf file using pdf_text()

Here, data from .pdf is fetched by pdf_text() and stored as a character vector matching .pdf page count

pdfText <- pdf_text(mypath)   # to fetch data of PDF file
typeof(pdfText)               # to check the data type
[1] "character"
pdfText[2]                    # to show content of the second page
[1] "   Example 4: Automobile Land Speed Records (GR 5-10)\n   In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of\n   Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour\n   (mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across\n   frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American\n   Eagle is trying to break a land speed record of 800 mph. The Federation International de\n   L’Automobile (FIA), the world’s governing body for motor sport and land speed records,\n   recorded the following land speed records. (Retrieved on February 5, 2006, from\n   http://www.landspeed.com/lsrinfo.asp.)\n\n      Speed (mph)         Driver                 Car                          Engine      Date\n\n      407.447             Craig Breedlove        Spirit of America            GE J47      8/5/63\n\n      413.199             Tom Green              Wingfoot Express             WE J46      10/2/64\n\n      434.22              Art Arfons             Green Monster                GE J79      10/5/64\n\n      468.719             Craig Breedlove        Spirit of America            GE J79      10/13/64\n\n      526.277             Craig Breedlove        Spirit of America            GE J79      10/15/65\n\n      536.712             Art Arfons             Green Monster                GE J79      10/27/65\n\n      555.127             Craig Breedlove        Spirit of America, Sonic 1   GE J79      11/2/65\n\n      576.553             Art Arfons             Green Monster                GE J79      11/7/65\n\n      600.601             Craig Breedlove        Spirit of America, Sonic 1   GE J79      11/15/65\n\n      622.407             Gary Gabelich          Blue Flame                   Rocket      10/23/70\n\n      633.468             Richard Noble          Thrust 2                     RR RG 146   10/4/83\n\n      763.035             Andy Green             Thrust SSC                   RR Spey     10/15/97\n\n   Example 5: Distance and Time (GR 8-10)\n   The following data were collected using a car with a water clock set to release a drop in\n   a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were\n   run. Create a data table with an average distance column and an average velocity column,\n   create an average distance-time graph, and draw the best-fit line or curve. Estimate the\n   car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is\n   it going at a constant speed, accelerating, or decelerating? How do you know?\n\n                    Time (drops of water)              Distance (cm)\n                               1                           10,11,9\n                               2                           29, 31, 30\n                               3                           59, 58, 61\n                               4                           102, 100, 98\n                               5                           122, 125, 127\n\n\n\n© 2006 WGBH Educational Foundation. All rights reserved.\n\n                                                        2\n"

Note: pdfText is a character vector, where each element represents the text of one PDF page

Step 3) Process each page into lines

But why split PDF text into lines?

Converting text to lines allows to process each line separately, making it easier to:

  • Identify headers or keywords

  • Locate patterns (e.g., names, dates, amounts)

  • Extract tabular data - Improves readability of the extracted data

  • Makes data cleaning and transformation easier

  • Enables you to loop through each line if necessary

Note: This step can be skipped in case simple text analysis is to be done

cleaned_pages <- lapply(pdfText, function(page) {     # Applies function to each element(i.e., each page) in pdfText

  text_lines <- strsplit(page, "\n")[[1]]       # Splits the text of the PDF into lines based on "\n".
  
  return(text_lines)
})

typeof(cleaned_pages)
[1] "list"
# cleaned_text <- unlist(cleaned_pages)  # Flatten into a single vector
cleaned_text <- cleaned_pages[[2]]    # Extract second page for demo
cleaned_text
 [1] "   Example 4: Automobile Land Speed Records (GR 5-10)"                                             
 [2] "   In the first recorded automobile race in 1898, Count Gaston de Chasseloup-Laubat of"            
 [3] "   Paris, France, drove 1 kilometer in 57 seconds for an average speed of 39.2 miles per hour"     
 [4] "   (mph) or 63.1 kilometers per hour (kph). In 1904, Henry Ford drove his Ford Arrow across"       
 [5] "   frozen Lake St. Clair, MI, at an average speed of 91.4 mph. Now, the North American"            
 [6] "   Eagle is trying to break a land speed record of 800 mph. The Federation International de"       
 [7] "   L’Automobile (FIA), the world’s governing body for motor sport and land speed records,"         
 [8] "   recorded the following land speed records. (Retrieved on February 5, 2006, from"                
 [9] "   http://www.landspeed.com/lsrinfo.asp.)"                                                         
[10] ""                                                                                                  
[11] "      Speed (mph)         Driver                 Car                          Engine      Date"    
[12] ""                                                                                                  
[13] "      407.447             Craig Breedlove        Spirit of America            GE J47      8/5/63"  
[14] ""                                                                                                  
[15] "      413.199             Tom Green              Wingfoot Express             WE J46      10/2/64" 
[16] ""                                                                                                  
[17] "      434.22              Art Arfons             Green Monster                GE J79      10/5/64" 
[18] ""                                                                                                  
[19] "      468.719             Craig Breedlove        Spirit of America            GE J79      10/13/64"
[20] ""                                                                                                  
[21] "      526.277             Craig Breedlove        Spirit of America            GE J79      10/15/65"
[22] ""                                                                                                  
[23] "      536.712             Art Arfons             Green Monster                GE J79      10/27/65"
[24] ""                                                                                                  
[25] "      555.127             Craig Breedlove        Spirit of America, Sonic 1   GE J79      11/2/65" 
[26] ""                                                                                                  
[27] "      576.553             Art Arfons             Green Monster                GE J79      11/7/65" 
[28] ""                                                                                                  
[29] "      600.601             Craig Breedlove        Spirit of America, Sonic 1   GE J79      11/15/65"
[30] ""                                                                                                  
[31] "      622.407             Gary Gabelich          Blue Flame                   Rocket      10/23/70"
[32] ""                                                                                                  
[33] "      633.468             Richard Noble          Thrust 2                     RR RG 146   10/4/83" 
[34] ""                                                                                                  
[35] "      763.035             Andy Green             Thrust SSC                   RR Spey     10/15/97"
[36] ""                                                                                                  
[37] "   Example 5: Distance and Time (GR 8-10)"                                                         
[38] "   The following data were collected using a car with a water clock set to release a drop in"      
[39] "   a unit of time and a meter stick. The car rolled down an inclined plane. Three trials were"     
[40] "   run. Create a data table with an average distance column and an average velocity column,"       
[41] "   create an average distance-time graph, and draw the best-fit line or curve. Estimate the"       
[42] "   car’s distance traveled and velocity at six drops of water. Describe the motion of the car. Is" 
[43] "   it going at a constant speed, accelerating, or decelerating? How do you know?"                  
[44] ""                                                                                                  
[45] "                    Time (drops of water)              Distance (cm)"                              
[46] "                               1                           10,11,9"                                
[47] "                               2                           29, 31, 30"                             
[48] "                               3                           59, 58, 61"                             
[49] "                               4                           102, 100, 98"                           
[50] "                               5                           122, 125, 127"                          
[51] ""                                                                                                  
[52] ""                                                                                                  
[53] ""                                                                                                  
[54] "© 2006 WGBH Educational Foundation. All rights reserved."                                          
[55] ""                                                                                                  
[56] "                                                        2"                                         
cleaned_text <- cleaned_text[c(11:35)] # Select lines for the table (11 to 35)

cleaned_text <- trimws(cleaned_text)  # Remove leading/trailing spaces

cleaned_text <- subset(cleaned_text, cleaned_text != "") # Remove blank lines

data_split <- strsplit(cleaned_text, "\\s{2,}") # Split rows into columns based on multiple spaces

Note: The processing steps may vary depending on your specific requirements

Step 4) Convert lines to data frame

df <- as.data.frame(do.call(rbind, data_split), stringsAsFactors = FALSE) # Convert to data frame

# Set column names and remove the header row
colnames(df) <- df[1,]
df <- df[-1,]

datatable(df)

Additional functions to understand PDF Structure

# To display dimension of each page
pdf_pagesize(mypath)

# To get meta data of PDF file
pdf_info(mypath)

# Displays font names
pdf_fonts(mypath)

Conclusion:

The “pdftools” package in R is a versatile solution for extracting text and tables from PDFs, offering benefits like ease of use and fast text extraction. However, it has limitations when handling complex table structures or poorly formatted PDFs, often requiring additional cleaning or complementary tools. Despite these challenges, pdftools remains a valuable tool for turning static documents into structured data, enhancing data workflows in various domains.